Skip to content

Add nvidia-cdi-refresh service #1076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented May 12, 2025

This pull request introduces a new systemd-based service to refresh the NVIDIA Container Device Interface (CDI) specification file upon NVIDIA driver installation or uninstallation. It also updates packaging scripts for Debian and RPM-based distributions to include and configure this service. Below are the most significant changes grouped by theme:

This systemd service covers static package operations:

  • first-time installation
  • upgrades
  • reboot's

does not handle:

  • driver removal (this is more on purpose at the end as we decided not to add logic for removing the CDI file)
  • runtime topology changes (MIG, hot-plug, module unload/load)

New Systemd Service for CDI Refresh:

  • Added nvidia-cdi-refresh.service to execute the CDI refresh using nvidia-ctk and nvidia-cdi-refresh.path to monitor driver installation/uninstallation events. These files ensure the service is triggered when driver-related files change. (deployments/systemd/nvidia-cdi-refresh.service, deployments/systemd/nvidia-cdi-refresh.path) [1] [2]

Updates to Dockerfiles:

  • Modified Dockerfiles for multiple distributions (debian, ubuntu, opensuse-leap, rpm-yum) to include the new systemd service and related files in the build process. (docker/Dockerfile.debian, docker/Dockerfile.ubuntu, docker/Dockerfile.opensuse-leap, docker/Dockerfile.rpm-yum) [1] [2] [3] [4]

Debian Packaging Changes:

  • Introduced a new Debian package nvidia-container-toolkit-cdi-refresh to install the systemd service and path files, along with a post-installation script to enable and start the service on installation. (packaging/debian/control, packaging/debian/nvidia-container-toolkit-cdi-refresh.install, packaging/debian/nvidia-container-toolkit-cdi-refresh.postinst) [1] [2] [3]

RPM Packaging Changes:

  • Updated the RPM spec file to include the new systemd service and path files, ensuring proper installation, configuration, and enabling of the service during package installation. (packaging/rpm/SPECS/nvidia-container-toolkit.spec) [1] [2] [3]

@ArangoGutierrez ArangoGutierrez requested review from elezar and Copilot May 12, 2025 13:28
@ArangoGutierrez ArangoGutierrez self-assigned this May 12, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces an NVIDIA CDI refresh service to automatically update the NVIDIA CDI specification when system events occur, alongside packaging and integration changes for Debian, RPM, and Docker build processes.

  • Added new systemd unit files and udev rules to trigger the CDI refresh service.
  • Updated packaging scripts for both Debian and RPM distributions and modified Dockerfiles to include new deployment artifacts.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
packaging/rpm/SPECS/nvidia-container-toolkit.spec Added Source entries and installation steps for the new service components.
packaging/debian/nvidia-container-toolkit-cdi-refresh.postinst Introduced post-installation steps for Debian packaging.
packaging/debian/nvidia-container-toolkit-cdi-refresh.install Listed new deployment files for Debian.
packaging/debian/control Added a new control entry for the CDI refresh service package.
docker/Dockerfile.* Updated Dockerfiles to copy new systemd and udev files.
deployments/udev/60-nvidia-cdi-refresh.rules New udev rules to trigger service on NVIDIA events.
deployments/systemd/nvidia-cdi-refresh.service Defined the one-shot systemd service to refresh the CDI spec.
deployments/systemd/nvidia-cdi-refresh.path Defined the systemd path unit to monitor module changes.
Comments suppressed due to low confidence (1)

packaging/debian/control:33

  • The package name 'nvidia-container-toolkit-cdi-refresh service' contains a space which might cause issues; consider renaming it to 'nvidia-container-toolkit-cdi-refresh-service' for consistency.
Package: nvidia-container-toolkit-cdi-refresh service

@ArangoGutierrez ArangoGutierrez force-pushed the refresh_cdi branch 2 times, most recently from 09aee84 to c366578 Compare May 12, 2025 13:47
# limitations under the License.

[Unit]
Description=Refresh NVIDIA OCI CDI specification file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Description=Refresh NVIDIA OCI CDI specification file
Description=Refresh NVIDIA CDI specification file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

[Service]
Type=oneshot
# The 30-second delay ensures that dependent services or resources are fully initialized.
# If the rationale for this delay is unclear, consider evaluating whether a shorter delay is sufficient.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# If the rationale for this delay is unclear, consider evaluating whether a shorter delay is sufficient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ArangoGutierrez ArangoGutierrez force-pushed the refresh_cdi branch 2 times, most recently from d2bb23d to 077bd25 Compare May 12, 2025 14:05
@ArangoGutierrez ArangoGutierrez requested a review from elezar May 12, 2025 14:05
[Service]
Type=oneshot
# The 30-second delay ensures that dependent services or resources are fully initialized.
ExecStartPre=/bin/sleep 30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we have a 30 second sleep after EVERY event?

Copy link
Collaborator Author

@ArangoGutierrez ArangoGutierrez May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the only event that actually needs this is install/uninstall.
During an apt install cuda-drivers, it can take from 10 to 30 seconds from the first line of this logs to the last line

                                                                                                                                     
depmod....                                                                                                                           
Setting up libnvidia-decode-575:amd64 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                            
Setting up libnvidia-compute-575:amd64 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                           
Setting up libnvidia-encode-575:amd64 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                            
Setting up nvidia-utils-575 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                                      
Setting up nvidia-compute-utils-575 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                              
Setting up libnvidia-gl-575:amd64 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                                
Setting up nvidia-driver-575 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                                     
Setting up cuda-drivers-575 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                                      
Setting up cuda-drivers (575.51.03-0ubuntu1) ...                                                                                                                                                                                                                          
Processing triggers for mailcap (3.70+nmu1ubuntu1) ...                                                                                                                                                                                                                    
Processing triggers for desktop-file-utils (0.26-1ubuntu3) ...                                                                       
Processing triggers for initramfs-tools (0.140ubuntu13.4) ...                                                                                                                                                                                                             
update-initramfs: Generating /boot/initrd.img-5.15.0-136-generic                                                                                                                                                                                                          
W: Possible missing firmware /lib/firmware/ast_dp501_fw.bin for module ast                                                                                                                                                                                                
Processing triggers for gnome-menus (3.36.0-1ubuntu3) ...                                                                            
Processing triggers for libc-bin (2.35-0ubuntu3.9) ...                                                                               
Processing triggers for man-db (2.10.2-1) ...                                                                                        
Processing triggers for dbus (1.12.20-2ubuntu4.1) ...                                                                                                                                                                                                                     
Scanning processes...                                                                                                                                                                                                                                                     
Scanning processor microcode...                                                                                                                                                                                                                                           
Scanning linux images...                                                                                                             
                                                                                                                                                                                                                                                                          
Running kernel seems to be up-to-date.                                                                                                                                                                                                                                    
                                                                                                                                     
The processor microcode seems to be up-to-date.                                                                                      
                                                                                                                                                                                                                                                                          
No services need to be restarted.                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                          
No containers need to be restarted.                                                                                                                                                                                                                                       
                                                                                                                                     
No user sessions are running outdated binaries.                                                                                      
                                                                                                                                     
No VM guests are running outdated hypervisor (qemu) binaries on this host. 

so 30 seconds is a safe time to wait for the full DEB install to happen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not have an additional package on rpm-based systems?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean adding the service install as part of the regular RPM install script?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the debian packages you added an nvidia-container-toolkit-cdi-refresh package that includes the systemd unit and udef rules. This seems to be missing from the RPM packages.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your question, on the DEB side I noticed we have separated specific components into individual packages.
But on the RPM side we only have 1 file (RPM Package def file) so I followed the structure and added the install of the new 3 files in the same RPM def file

packaging
├── debian
│   ├── changelog.old
│   ├── compat
│   ├── control
│   ├── copyright
│   ├── nvidia-container-toolkit-base.install
│   ├── nvidia-container-toolkit-base.postinst
│   ├── nvidia-container-toolkit-cdi-refresh.install
│   ├── nvidia-container-toolkit-cdi-refresh.postinst
│   ├── nvidia-container-toolkit-operator-extensions.install
│   ├── nvidia-container-toolkit.install
│   ├── nvidia-container-toolkit.lintian-overrides
│   ├── nvidia-container-toolkit.postinst
│   ├── nvidia-container-toolkit.postrm
│   ├── prepare
│   └── rules
└── rpm
    ├── SOURCES
    │   └── LICENSE
    └── SPECS
        └── nvidia-container-toolkit.spec

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same RPM spec can be used to generate separate pacakges. We generate the following packages:

  • nvidia-container-toolkit
  • nvidia-container-toolkit-base
  • nvidia-container-toolkit-operator-extensions

These are defined in the SAME spec file.

In the case of the nvidia-container-toolkit-base package the relevant section is:

# The BASE package consists of the NVIDIA Container Runtime and the NVIDIA Container Toolkit CLI.
# This allows the package to be installed on systems where no NVIDIA Container CLI is available.
%package base
Summary: NVIDIA Container Toolkit Base
Obsoletes: nvidia-container-runtime <= 3.5.0-1, nvidia-container-runtime-hook <= 1.4.0-2
Provides: nvidia-container-runtime
# Since this package allows certain components of the NVIDIA Container Toolkit to be installed separately
# it conflicts with older versions of the nvidia-container-toolkit package that also provide these files.
Conflicts: nvidia-container-toolkit <= 1.10.0-1

%description base
Provides tools such as the NVIDIA Container Runtime and NVIDIA Container Toolkit CLI to enable GPU support in containers.

%files base
%license LICENSE
%{_bindir}/nvidia-container-runtime
%{_bindir}/nvidia-ctk
%{_bindir}/nvidia-cdi-hook

Assuming you wanted to have a *separate nvidia-container-toolkit-cdi-refresh package as you did for the debian case, you would have to add a matching section to the SPEC file for the package.

I don't think that we should have a separate package since we want to install this by default.

Type=oneshot
# The 30-second delay ensures that dependent services or resources are fully initialized.
ExecStartPre=/bin/sleep 30
ExecStart=/usr/bin/nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should this path depend on the installation locations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, yes. umm let me think how to mod this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm after checking we /usr/bin set as default for both deb/rpm, and we don't provide macros to change the install path. So having this hardcoded here looks ok. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then install command installs to:

install -m 755 -t %{buildroot}%{_bindir} nvidia-ctk

If this is always /usr/bin then we don't neet to update this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation, we can leave it as is. - https://docs.fedoraproject.org/en-US/packaging-guidelines/RPMMacros/#macros_installation

@ArangoGutierrez ArangoGutierrez requested a review from elezar May 12, 2025 14:24
@ArangoGutierrez ArangoGutierrez force-pushed the refresh_cdi branch 5 times, most recently from a641d38 to d150c66 Compare May 16, 2025 16:42
@ArangoGutierrez ArangoGutierrez force-pushed the refresh_cdi branch 3 times, most recently from d8b0cb2 to 5ffb29b Compare May 20, 2025 08:30

Package: nvidia-container-toolkit-cdi-refresh
Architecture: any
Depends: ${misc:Depends}, nvidia-container-toolkit-base (= @VERSION@)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that this will not be installed by default. Should we just include the Systemd units in the -base package?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, not by default. Maybe yes, it makes sense to move to the base pkg.

Comment on lines 82 to 83
%config /etc/systemd/system/nvidia-cdi-refresh.service
%config /etc/systemd/system/nvidia-cdi-refresh.path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does defining these as configs mean that a remove doesn't remove them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm yes, I was following what I saw on the mig-parted install script, but I guess here we want a different behaviour

@@ -29,3 +29,9 @@ Architecture: any
Depends: ${misc:Depends}, nvidia-container-toolkit-base (= @VERSION@)
Description: NVIDIA Container Toolkit Operator Extensions
Provides tools for using the NVIDIA Container Toolkit with the GPU Operator

Package: nvidia-container-toolkit-cdi-refresh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I called out in a previous comment, we don't have the same package in the RPM package spec. Could we discuss why these are not just part of the -base package?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for consistency against RPM I have merged this into the base pkg


[Unit]
Description=Refresh NVIDIA CDI specification file
ConditionPathExists=/usr/bin/nvidia-smi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check for /usr/bin/nvidia-ctk?

@ArangoGutierrez ArangoGutierrez requested review from elezar and Copilot May 20, 2025 15:09
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new systemd service to refresh the NVIDIA CDI specification on driver installation events and updates packaging and Docker build files accordingly.

  • Introduces nvidia-cdi-refresh.service and nvidia-cdi-refresh.path for automatically triggering CDI refresh.
  • Updates RPM and Debian packaging scripts to install and enable the new service.
  • Modifies Dockerfiles for multiple distributions to include the new systemd files.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packaging/rpm/SPECS/nvidia-container-toolkit.spec Adds systemd unit sources and installation steps for RPM.
packaging/debian/nvidia-container-toolkit-base.postinst Enables the new path unit during post-installation.
packaging/debian/nvidia-container-toolkit-base.install Installs the new systemd service and path files.
docker/Dockerfile.ubuntu Copies systemd files into the Ubuntu Docker build.
docker/Dockerfile.rpm-yum Copies systemd files into the RPM Docker build.
docker/Dockerfile.opensuse-leap Copies systemd files into the openSUSE Docker build.
docker/Dockerfile.debian Copies systemd files into the Debian Docker build.
deployments/systemd/nvidia-cdi-refresh.service Defines the new systemd service for CDI refresh.
deployments/systemd/nvidia-cdi-refresh.path Defines the path unit to trigger the CDI refresh service.

@ArangoGutierrez ArangoGutierrez force-pushed the refresh_cdi branch 2 times, most recently from ae30b74 to fb58ffc Compare May 20, 2025 15:24
@@ -71,6 +71,7 @@ RUN make PREFIX=${DIST_DIR} cmds

WORKDIR $DIST_DIR/..
COPY packaging/rpm .
COPY deployments/systemd/* ${DIST_DIR}/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that we don't use:

Suggested change
COPY deployments/systemd/* ${DIST_DIR}/
COPY deployments/systemd/ ${DIST_DIR}/

Like we do for the other Dockerfiles?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -89,6 +103,9 @@ Provides tools such as the NVIDIA Container Runtime and NVIDIA Container Toolkit
%{_bindir}/nvidia-container-runtime
%{_bindir}/nvidia-ctk
%{_bindir}/nvidia-cdi-hook
%dir /etc/systemd/system
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this is not correct? This directory is not part of the package. Why do we not create this folder when we install the files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right, removed

Comment on lines 45 to 46
install -m 644 -t %{buildroot}/etc/systemd/system %{SOURCE7}
install -m 644 -t %{buildroot}/etc/systemd/system %{SOURCE8}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should /etc be %{_sysconfdir}?

Suggested change
install -m 644 -t %{buildroot}/etc/systemd/system %{SOURCE7}
install -m 644 -t %{buildroot}/etc/systemd/system %{SOURCE8}
install -m 644 -t %{buildroot}%{_sysconfdir}/systemd/system nvidia-cdi-refresh.service
install -m 644 -t %{buildroot}%{_sysconfdir}/systemd/system nvidia-cdi-refresh.path

(we're also not using the source reference for the other files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

install -m 755 -t %{buildroot}%{_bindir} nvidia-container-runtime-hook
install -m 755 -t %{buildroot}%{_bindir} nvidia-container-runtime
install -m 755 -t %{buildroot}%{_bindir} nvidia-container-runtime.cdi
install -m 755 -t %{buildroot}%{_bindir} nvidia-container-runtime.legacy
install -m 755 -t %{buildroot}%{_bindir} nvidia-ctk
install -m 755 -t %{buildroot}%{_bindir} nvidia-cdi-hook
install -m 644 -t %{buildroot}/etc/systemd/system %{SOURCE7}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do debian packages allow for filemode controls too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually yes! https://www.debian.org/doc/debian-policy/ch-controlfields.html
I added a couple new lines according to documentation

@ArangoGutierrez ArangoGutierrez force-pushed the refresh_cdi branch 2 times, most recently from 93584d6 to 4454142 Compare May 21, 2025 15:24
@elezar elezar added this to the v1.18.0 milestone May 27, 2025

%install
mkdir -p %{buildroot}%{_bindir}
mkdir -p %{buildroot}/etc/systemd/system/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the %{_sysconfdir} macro for /etc instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

Comment on lines 54 to 63
# Reload systemd unit cache
/bin/systemctl daemon-reload || :

# On fresh install ($1 == 1) enable the path unit so it starts at boot
if [ "$1" -eq 1 ]; then
/bin/systemctl enable --now nvidia-cdi-refresh.path || :
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the debian package we have additional checks for whether systemctl is a know application. Is this not required here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 106 to 107
/etc/systemd/system/nvidia-cdi-refresh.service
/etc/systemd/system/nvidia-cdi-refresh.path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/etc/systemd/system/nvidia-cdi-refresh.service
/etc/systemd/system/nvidia-cdi-refresh.path
%{_sysconfdir}/systemd/system/nvidia-cdi-refresh.service
%{_sysconfdir}/systemd/system/nvidia-cdi-refresh.path

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm you are right, edited

@coveralls
Copy link

coveralls commented May 28, 2025

Pull Request Test Coverage Report for Build 15293538731

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 33.149%

Totals Coverage Status
Change from base Build 15277083860: 0.0%
Covered Lines: 4187
Relevant Lines: 12631

💛 - Coveralls

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a systemd-based mechanism to automatically regenerate the NVIDIA CDI specification file when NVIDIA drivers are installed or upgraded, and updates packaging and Docker builds to include and enable this new service.

  • Introduces nvidia-cdi-refresh.service and .path units
  • Updates Dockerfiles to bundle the systemd units into each distribution build
  • Enhances RPM and Debian packages to install, reload, and enable the new units

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packaging/rpm/SPECS/nvidia-container-toolkit.spec Adds service/path sources, installs units, and reloads/enables them in %post
packaging/debian/rules Overrides dh_fixperms to set correct permissions on new units
packaging/debian/nvidia-container-toolkit-base.postinst Reloads systemd and enables the .path unit on first configure
packaging/debian/nvidia-container-toolkit-base.install Installs service and path files under /etc/systemd/system/
docker/Dockerfile.debian Copies systemd deployment files into Debian build context
docker/Dockerfile.ubuntu Copies systemd deployment files into Ubuntu build context
docker/Dockerfile.opensuse-leap Copies systemd deployment files into openSUSE build context
docker/Dockerfile.rpm-yum Copies systemd deployment files into RPM/YUM build context
deployments/systemd/nvidia-cdi-refresh.service Defines the one-shot service to generate the CDI spec
deployments/systemd/nvidia-cdi-refresh.path Defines the path unit to trigger regeneration on driver changes
Comments suppressed due to low confidence (2)

deployments/systemd/nvidia-cdi-refresh.path:19

  • %v is not a valid systemd specifier for the kernel version, so PathChanged may not fire. Use a literal path (e.g., /lib/modules/$(uname -r)/modules.dep), a wildcard glob, or a supported specifier (or consider PathExistsGlob) to monitor the correct file.
PathChanged=/lib/modules/%v/modules.dep

deployments/systemd/nvidia-cdi-refresh.service:22

  • The %v specifier is not recognized by systemd and will prevent locating modules.dep. Replace it with the actual kernel version (e.g., via wildcard or $(uname -r) substitution) or a valid systemd specifier.
ExecCondition=/usr/bin/grep -qE '/nvidia.ko' /lib/modules/%v/modules.dep

@ArangoGutierrez ArangoGutierrez force-pushed the refresh_cdi branch 2 times, most recently from 9b14e59 to fa05684 Compare May 28, 2025 06:06
Automatic regeneration of /var/run/cdi/nvidia.yaml
New units:
	•	nvidia-cdi-refresh.service – one-shot wrapper for
			nvidia-ctk cdi generate (adds sleep + required caps).
	•	nvidia-cdi-refresh.path   – fires on driver install/upgrade via
			modules.dep.bin changes.
Packaging
	•	RPM %post reloads systemd and enables the path unit on fresh
			installs.
	•	DEB postinst does the same (configure, skip on upgrade).

Result: CDI spec is always up to date

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants